BMGSJoin:一种基于MapReduce的图相似度连接算法<sup>*</sup>

doi:10.16451/j.cnki.issn1003-6059.201505011

摘要
图/表
参考文献
相关文章 (3)

全文: PDF (551 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要图相似度连接在数据挖掘领域应用广泛，尤其是在数据预处理阶段，可用于数据清理、近复本检测等，其研究具有十分重要的意义.针对基于编辑距离约束的图相似度连接问题进行研究，返回两个图集合中所有编辑距离不超过给定阈值的图对.基于分布式编程框架MapReduce，设计采用“过滤-验证”框架的MGSJoin算法，利用基于路径的q-gram签名实现非解候选对的过滤，计数过滤.鉴于该算法键值对数量庞大的潜在问题，引入Bloom Filter技术对算法进行改进并设计BMGSJoin算法.实验结果表明，提出的两种图相似度连接算法能较大地改善现有算法的效率和可扩展性，并能较好地应对当前大数据挖掘分析的需求.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	陈一帆
	赵翔
	何培俊
	张维明
	唐九阳

关键词 ：图相似度连接, MapReduce, 布隆过滤器

Abstract：Graph similarity join has extensive use in the field of data mining, especially in data pre-processing, it could be applied to data cleaning, near duplicate detection, etc. Thus, it is of great importance to study graph similarity join. Graph similarit join based on edit distance constraints is studied, that is, all the edit distances in the return pair of graphs are no larger than a given threshold. Based on MapReduce programming model, an algorithm named MGSJoin is proposed with the ″filtering-verification″ framework, and it relies on graph signatures of path-based q-grams for filtering out non-promising candidates, i.e. count filtering.With the potential issue of too many key-value pairs, Bloom Filter is introduced to improve the algorithm and BMGSJoin is designed. The improvement of efficiency and scalability by the proposed algorithm is demonstrated by extensive experimental results, and it may meet the current challenges of big data mining and analysis.

Key words： Graph Similarity Join MapReduce Bloom Filter

收稿日期: 2014-03-25

ZTFLH:

TP 391.4

基金资助:国家自然科学基金项目(No.61402494)、湖南省自然科学基金项目(No.2015JJ4009)资助

作者简介: 陈一帆，男，1991年生，硕士研究生，主要研究方向为信息资源管理.赵翔(通讯作者)，男，1986年生，博士，讲师，主要研究方向为图数据管理及社会计算.E-mail:xiangzhao@nudt.edu.cn.何培俊，男，1988年生，硕士研究生，主要研究方向为信息资源管理.张维明，男，1962年生，博士，教授，主要研究方向为军事信息系统与智能决策.唐九阳，男，1978年生，博士，副教授，主要研究方向为对等计算、传感器网络及社会计算.

引用本文:

陈一帆，赵翔，何培俊，张维明，唐九阳. BMGSJoin:一种基于MapReduce的图相似度连接算法^*[J]. 模式识别与人工智能, 2015, 28(5): 472-480. CHEN Yi-Fan, ZHAO Xiang, HE Pei-Jun, ZHANG Wei-Ming, TANG Jiu-Yang. BMGSJoin: A MapReduce Based Graph Similarity Join Algorithm. , 2015, 28(5): 472-480.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.201505011 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2015/V28/I5/472

[1] Yan X F, Han J W. gSpan: Graph-Based Substructure Pattern Mining // Proc of the IEEE International Conference on Data Mining. Maebashi, Japan, 2002: 721-724
[2] Yan X F, Yu P S, Han J W. Graph Indexing: A Frequent Structure-Based Approach // Proc of the ACM SIGMOD International Conference on Management of Data. Paris, France, 2004: 335-346
[3] He H H, Singh A K. Closure-Tree: An Index Structure for Graph Queries // Proc of the 22nd International Conference on Data Engineering. Atlanta, USA, 2006: 38.1-38.12
[4] Williams D W, Huan J, Wang W. Graph Database Indexing Using Structured Graph Decomposition // Proc of the 23rd International Conference on Data Engineering. Istanbul, Turkey, 2007: 976-985
[5] Tian Y Y, Patel J M. Tale: A Tool for Approximate Large Graph Matching // Proc of the 24th IEEE International Conference on Data Engineering. Cancun, Mexico, 2008: 963-972
[6] Yan X F, Yu P S, Han J W. Substructure Similarity Search in Graph Databases // Proc of the ACM SIGMOD International Conference on Management of Data. Baltimore, USA, 2005: 766-777
[7] Zhao X, Xiao C, Lin X M, et al. Efficient Graph Similarity Joins with Edit Distance Constraints // Proc of the 28th IEEE International Conference on Data Engineering. Washington, USA, 2012: 834-845
[8] Sanfeliu A, Fu K S. A Distance Measure between Attributed Relational Graphs for Pattern Recognition. IEEE Trans on Systems, Man and Cybernetics, 1983, SMC-13(3): 353-362
[9] Bunke H, Allermann G. Inexact Graph Matching for Structural Pattern Recognition. Pattern Recognition Letters, 1983, 1(4): 245-253
[10] Zeng Z P, Tung A K H, Wang J Y, et al. Comparing Stars: On Approximating Graph Edit Distance // Proc the 35th International Conference on Very Large Data Bases. Lyon, France, 2009: 25-36
[11] Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 2008, 51(1): 107-113
[12] Deng D, Li G L, Hao S, et al. MassJoin: A MapReduce-Based Method for Scalable String Similarity Joins // Proc of the 30th IEEE International Conference on Data Engineering. Chicago, USA, 2014: 340-351
[13] Bloom B H. Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM, 1970, 13(7): 422-426
[14] Broder A, Mitzenmacher M. Network Applications of Bloom Filters: A Survey. Internet Mathematics, 2004, 1(4): 485-509
[15] Koutris P. Bloom Filters in Distributed Query Execution [EB/OL]. [2014-01-23]. http://courses.cs.washington.edu/courses/cse544/11wi/projects/koutris.pdf
[16] Cohen S, Matias Y. Spectral Bloom Filters // Proc of the ACM SIGMOD International Conference on Management of Data. San Diego, USA, 2003: 241-252
[17] Lin X M, Wang W. Set and String Similarity Queries: A Survey. Chinese Journal of Computers, 2011, 34(10): 1853-1862 (in Chinese)
(林学民, 王炜.集合和字符串的相似度查询.计算机学报, 2011, 34(10): 1853-1862
[18] Pang J, Gu Y, Xu J, et al. Research Advance on Similarity Join Queries. Journal of Frontiers of Computer Science and Technology, 2013, 7(1): 1-13 (in Chinese)
(庞俊,谷峪,许嘉,等.相似性连接查询技术研究进展.计算机科学与探索, 2013, 7(1): 1-13)
[19] Wang G, Wang B, Yang X C, et al. Efficiently Indexing Large Sparse Graphs for Similarity Search. IEEE Trans on Knowledge and Data Engineering, 2012, 24(3): 440-451
[20] Cohen J. Graph Twiddling in a MapReduce World. Computing in Science & Engineering, 2009, 11(4): 29-41
[21] Lattanzi S, Moseley B, Suri S, et al. Filtering: A Method for Solving Graph Problems in MapReduce // Proc of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures. San Jose, USA, 2011: 85-94
[22] Suri S, Vassilvitskii S. Counting Triangles and the Curse of the Last Reducer // Proc of the 20th International Conference on World Wide Web. Hyderabad, India, 2011: 607-614
[23] Kang U, Tsourakakis C E, Appel A P, et al. HADI: Mining Radii of Large Graphs. ACM Transactions on Knowledge Discovery from Data, 2011, 5(2): 8.1-8.24
[24] Bahmani B, Chakrabarti K, Xin D. Fast Personalized Pagerank on MapRreduce // Proc of the ACM SIGMOD International Conference on Management of data. Athens, Greece, 2011: 973-984
[25] Kang U, Tong H H, Sun J M, et al. Gbase: A Scalable and General Graph Management System // Proc of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, USA, 2011: 1091-1099
[26] Kang U, Tsourakakis C E, Faloutsos C. Pegasus: A Peta-Scale Graph Mining System Implementation and Observations // Proc of the 9th IEEE International Conference on Data Mining. Miami, USA, 2009: 229-238
[27] Bahmani B, Kumar R, Vassilvitskii S. Densest Subgraph in Streaming and MapReduce. Proceedings of the VLDB Endowment, 2012, 5(5): 454-465
[28] Afrati F N, Fotakis D, Ullman J D. Enumerating Subgraph Instances Using Map-Reduce // Proc of the 29th IEEE International Conference on Data Engineering. Brisbane, Australia, 2013: 62-73